{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9FqeieOC5vUB"
      },
      "source": [
        "<a target=\"_blank\" href=\"https://colab.research.google.com/github/qianniucity/llm_notebooks/blob/main/notebooks/Evaluating_the_Ideal_Chunk_Size_for_a_RAG_System_using_LlamaIndex.ipynb\">\n",
        "  <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n",
        "</a>\n",
        "\n",
        "# 使用LlamaIndex的响应评估模块确定RAG系统的最佳块大小"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NIvSXj365r2n"
      },
      "source": [
        "# **前言**\n",
        "\n",
        "检索增强生成（RAG）引入了一种创新的方法，将搜索系统的广泛检索能力与 LLM（Large Language Model）相结合。在实施 RAG 系统时，一个关键的参数决定了系统的效率和性能，那就是 `chunk_size`（块大小）。如何确定无缝检索的最佳块大小呢？这就是 LlamaIndex 的 `Response Evaluation`（响应评估）派上用场的地方。在本博客文章中，我们将指导您通过使用 LlamaIndex 的 `Response Evaluation` 模块来确定最佳的 `chunk size`。如果您对 `Response Evaluation` 模块不熟悉，我们建议在继续之前先查看其[文档](https://docs.llamaindex.ai/en/latest/core_modules/supporting_modules/evaluation/modules.html) 。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dpbtWrEa53Ct"
      },
      "source": [
        "## **为什么块大小很重要**\n",
        "\n",
        "选择正确的 `chunk_size` 是一个关键决策，它可以从多个方面影响 RAG 系统的效率和准确性：\n",
        "\n",
        "1. **相关性与粒度**：小的 `chunk_size`，如128，会产生更细粒度的块。然而，这种细粒度可能带来风险：如果 `similarity_top_k` 设置为 2 这样的限制性，那么关键信息可能不在最顶部检索到的块中。相反，块大小为 512 很可能包含所有必要的信息在顶部的块内，确保查询的答案随时可用。为了解决这个问题，我们采用了忠实度和相关性指标。这些指标分别基于查询和检索的上下文来衡量‘幻觉’的缺失和‘响应’的相关性。\n",
        "2. **响应生成时间**：随着 `chunk_size` 的增加，输入到 LLM 中以生成答案的信息量也会增加。虽然这可以确保更全面的上下文，但它也可能减慢系统的速度。确保增加的深度不会损害系统的响应性至关重要。\n",
        "\n",
        "总的来说，确定最佳的 `chunk_size` 是关于达到平衡：捕获所有必要的信息而不牺牲速度。对于各种大小进行彻底的测试以找到适合特定用例和数据集的配置是非常重要的。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "SR8jlf3358_z"
      },
      "source": [
        "## **设置**\n",
        "\n",
        "在进行实验之前，我们需要确保导入了所有必要的模块："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ItNWVKRRD67j"
      },
      "outputs": [],
      "source": [
        "%pip install llama-index pypdf"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "y9SVm76h58de"
      },
      "outputs": [],
      "source": [
        "import nest_asyncio\n",
        "\n",
        "nest_asyncio.apply()\n",
        "\n",
        "from llama_index import (\n",
        "    SimpleDirectoryReader,\n",
        "    VectorStoreIndex,\n",
        "    ServiceContext,\n",
        ")\n",
        "from llama_index.evaluation import (\n",
        "    DatasetGenerator,\n",
        "    FaithfulnessEvaluator,\n",
        "    RelevancyEvaluator\n",
        ")\n",
        "from llama_index.llms import OpenAI\n",
        "\n",
        "import openai\n",
        "import time\n",
        "openai.api_key = 'OPENAI-API-KEY' # set your openai api key"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "SvEzZzif6G5O"
      },
      "source": [
        "## **下载数据集**\n",
        "\n",
        "我们将在这个实验中使用2021年的Uber 10K SEC文件。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ZOD_9THEErrc"
      },
      "outputs": [],
      "source": [
        "%mkdir -p 'data/10k/'\n",
        "%wget 'https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/examples/data/10k/uber_2021.pdf' -O 'data/10k/uber_2021.pdf'"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bO21UssT6L8N"
      },
      "source": [
        "## **加载数据**\n",
        "\n",
        "\n",
        "让我们加载我们的文档。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "x6QdEBd-17OC"
      },
      "outputs": [],
      "source": [
        "# Load Data\n",
        "\n",
        "reader = SimpleDirectoryReader(\"./data/10k/\")\n",
        "documents = reader.load_data()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "jnpPtiz56TYA"
      },
      "source": [
        "## **问题生成**\n",
        "\n",
        "为了选择合适的 `chunk_size`，我们将计算不同 `chunk_sizes` 的平均响应时间、忠实度和相关性等指标。`DatasetGenerator` 将帮助我们从文档中生成问题。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "26BgDF3L6Z0r"
      },
      "outputs": [],
      "source": [
        "# To evaluate for each chunk size, we will first generate a set of 40 questions from first 20 pages.\n",
        "eval_documents = documents[:20]\n",
        "data_generator = DatasetGenerator.from_documents()\n",
        "eval_questions = data_generator.generate_questions_from_nodes(num = 40)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "C3WwA-0N6dMO"
      },
      "source": [
        "## **设置评估器**\n",
        "\n",
        "我们正在设置 GPT-4 模型作为评估实验期间生成的响应的基础。两个评估器，`FaithfulnessEvaluator` 和 `RelevancyEvaluator`，都使用 `service_context` 进行初始化。\n",
        "\n",
        "1. **忠实度评估器** - 它有助于衡量响应是否是虚构的，并测量查询引擎的响应是否与任何源节点匹配。\n",
        "2. **相关性评估器** - 它有助于衡量响应是否实际回答了查询，并测量响应 + 源节点是否与查询匹配。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "G2LoMRtr6fnG"
      },
      "outputs": [],
      "source": [
        "# We will use GPT-4 for evaluating the responses\n",
        "gpt4 = OpenAI(temperature=0, model=\"gpt-4\")\n",
        "\n",
        "# Define service context for GPT-4 for evaluation\n",
        "service_context_gpt4 = ServiceContext.from_defaults(llm=gpt4)\n",
        "\n",
        "# Define Faithfulness and Relevancy Evaluators which are based on GPT-4\n",
        "faithfulness_gpt4 = FaithfulnessEvaluator(service_context=service_context_gpt4)\n",
        "relevancy_gpt4 = RelevancyEvaluator(service_context=service_context_gpt4)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "UUncIIxR6gVz"
      },
      "source": [
        "## **块大小的响应评估**\n",
        "\n",
        "我们根据 3 个指标评估每个 chunk_size：\n",
        "\n",
        "1. 平均响应时间。\n",
        "2. 平均可信度。\n",
        "3. 平均相关性。\n",
        "\n",
        "这是一个函数，`evaluate_response_time_and_accuracy`，它的作用是：\n",
        "\n",
        "1. 矢量索引创建。\n",
        "2. 构建查询引擎。\n",
        "3. 指标计算。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "dEC2Lr0z6p1N"
      },
      "outputs": [],
      "source": [
        "# Define function to calculate average response time, average faithfulness and average relevancy metrics for given chunk size\n",
        "# We use GPT-3.5-Turbo to generate response and GPT-4 to evaluate it.\n",
        "def evaluate_response_time_and_accuracy(chunk_size, eval_questions):\n",
        "    \"\"\"\n",
        "    Evaluate the average response time, faithfulness, and relevancy of responses generated by GPT-3.5-turbo for a given chunk size.\n",
        "\n",
        "    Parameters:\n",
        "    chunk_size (int): The size of data chunks being processed.\n",
        "\n",
        "    Returns:\n",
        "    tuple: A tuple containing the average response time, faithfulness, and relevancy metrics.\n",
        "    \"\"\"\n",
        "\n",
        "    total_response_time = 0\n",
        "    total_faithfulness = 0\n",
        "    total_relevancy = 0\n",
        "\n",
        "    # create vector index\n",
        "    llm = OpenAI(model=\"gpt-3.5-turbo\")\n",
        "    service_context = ServiceContext.from_defaults(llm=llm, chunk_size=chunk_size)\n",
        "    vector_index = VectorStoreIndex.from_documents(\n",
        "        eval_documents, service_context=service_context\n",
        "    )\n",
        "    # build query engine\n",
        "    # By default, similarity_top_k is set to 2. To experiment with different values, pass it as an argument to as_query_engine()\n",
        "    query_engine = vector_index.as_query_engine()\n",
        "    num_questions = len(eval_questions)\n",
        "\n",
        "    # Iterate over each question in eval_questions to compute metrics.\n",
        "    # While BatchEvalRunner can be used for faster evaluations (see: https://docs.llamaindex.ai/en/latest/examples/evaluation/batch_eval.html),\n",
        "    # we're using a loop here to specifically measure response time for different chunk sizes.\n",
        "    for question in eval_questions:\n",
        "        start_time = time.time()\n",
        "        response_vector = query_engine.query(question)\n",
        "        elapsed_time = time.time() - start_time\n",
        "\n",
        "        faithfulness_result = faithfulness_gpt4.evaluate_response(\n",
        "            response=response_vector\n",
        "        ).passing\n",
        "\n",
        "        relevancy_result = relevancy_gpt4.evaluate_response(\n",
        "            query=question, response=response_vector\n",
        "        ).passing\n",
        "\n",
        "        total_response_time += elapsed_time\n",
        "        total_faithfulness += faithfulness_result\n",
        "        total_relevancy += relevancy_result\n",
        "\n",
        "    average_response_time = total_response_time / num_questions\n",
        "    average_faithfulness = total_faithfulness / num_questions\n",
        "    average_relevancy = total_relevancy / num_questions\n",
        "\n",
        "    return average_response_time, average_faithfulness, average_relevancy"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "p8DQvTP96s48"
      },
      "source": [
        "## 测试不同的`chunk_size`\n",
        "\n",
        "我们将评估一组`chunk_size`，来确定哪一个指标最合适。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "jlKICwXH6Tib"
      },
      "outputs": [],
      "source": [
        "# 对不同的分块大小进行迭代，以评估指标，帮助确定分块大小。\n",
        "\n",
        "for chunk_size in [128, 256, 512, 1024, 2048]:\n",
        "  avg_response_time, avg_faithfulness, avg_relevancy = evaluate_response_time_and_accuracy(chunk_size)\n",
        "  print(f\"Chunk size {chunk_size} - Average Response time: {avg_response_time:.2f}s, Average Faithfulness: {avg_faithfulness:.2f}, Average Relevancy: {avg_relevancy:.2f}\")"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
